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Abstract 



High-dimensional phenotypes hold promise for richer findings in asso- 
ciation studies, but testing of several phenotype traits aggravates the grand 
challenge of association studies, that of multiple testing. Several methods 
have recently been proposed for testing jointly all traits in a high-dimensional 
vector of phenotypes, with prospect of increased power to detect small ef- 
fects that would be missed if tested individually. However, the methods have 
rarely been compared to the extent of enabling assessment of their relative 
merits and setting up guidelines on which method to use, and how to use it. 
We compare the methods on simulated data and with a real metabolomics 
data set comprising 137 highly correlated variables and approximately 550,000 
SNPs. Applying the methods to genome-wide data with hundreds of thou- 
sands of markers inevitably requires division of the problem into manage- 
able parts facilitating parallel processing, parts corresponding to individual 
genetic variants, pathways, or genes, for example. Here we utilize a straight- 
forward formulation according to which the genome is divided into blocks 
of nearby correlated genetic markers, tested jointly for association with the 
phenotypes. This formulation is computationally feasible, reduces the num- 
ber of tests, and lets the methods take advantage of combining information 
over several correlated variables not only on the phenotype side, but also on 
the genotype side. Our experiments show that canonical correlation analysis 
has higher power than alternative methods, while remaining computation- 
ally tractable for routine use in the GWAS setting, provided the number of 
samples is sufficient compared to the numbers of phenotype and genotype 
variables tested. Regression models with latent confounding factors show 
promising performance when the number of samples is small compared to 
the dimensionality of the data. 
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1 Introduction 



The increasingly widely collected 'omics' data, including genomics, transcrip- 
tomics, metabolomics, and proteomics data sets, bring completely new opportu- 
nities to genome-wide association studies (GWASs). GWAS searches for associa- 
tions between the genome, typically represented as the single nucleotide polymor- 
phisms (SNPs), and one phenotype variable, or trait. Examples of traits could be a 
disease indicator (binary, dichotomous) or the height of an individual (continuous- 
valued). The statistical challenge of the GWASs is the required multiple testing 
correction to account for the large number of associations (> 10 6 ) to be tested, 
see, e.g., Balding (2006). There is now a growing interest in detecting associa- 
tions between genomics and the other types of omics data, where the phenotype 
is multivariate in contrast to the classical setting. For example, associations be- 
tween genotypes and gene expressions (transcriptomics) have been studied, see, 
e.g., Parts et al. (201 1) and Fusi et al. (2012); more recently, the genotypes have 
been associated with metabolomics phenotypes (Tukiainen et al., 2012; Inouye et 
al., 2012). With such studies, the problems related to the multiple testing become 
accentuated due to the growing dimensionality of the phenotype vector. 

To tackle the statistical challenge in the omics-omics type association studies, 
several alternatives have been proposed. The simplest and most commonly used 
approach is to test associations between each genotype-phenotype pair in turn, and 
then to apply a stringent significance cutoff to account for the vast number of tests 
performed. For high-dimensional phenotypes this approach requires a very large 
sample size, however, as it fails to utilize the fact that a SNP affecting a phenotype 
variable is likely to have an effect also on other phenotypic variables which are 
highly correlated with the first one. The most straightforward modification to the 
simple pairwise testing is a combination of several pairwise tests, at the simplest 
by taking an average, for example. 

However, it has been argued by Kim and Xing (2009a) that accounting for the 
correlations between multiple phenotypes while testing for association is prefer- 
able to combining results from several related experiments in an ad hoc man- 
ner after the pairwise testing. Kim and Xing (2009a,b) used sparse regression 
models for multiple correlated traits. The models favor sparsity in the regres- 
sion coefficients while encouraging sharing of common regressors for correlated 
traits. Another class of statistical models that has been proposed for GWASs of 
high-dimensional phenotypes exploits latent variables to account for hidden con- 
founders that, if unaccounted, would blur the analysis by causing false positive 
findings and reduced power. These latent variable regression models have earlier 
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been successfully applied in association studies of gene expression measurements 
by Stegle et al. (2010) and Fusi et al. (2012). Canonical correlation analysis 
(CCA) is yet another multivariate technique that has recently been suggested as a 
tool for analyzing high-dimensional phenotypes in genome- wide association stud- 
ies (Ferreira and Purcell, 2009). CCA is a generalization of multivariate regres- 
sion, where, instead of testing for an association between a pair of variables, an 
association between two groups of variables is tested (Hotelling, 1936). CCA is 
conceptually related to the aforementioned latent variable regression models, as 
it can be cast into a probabilistic formulation according to which the so-called 
canonical correlation between the two groups of variables is explained by hidden 
factors affecting simultaneously both sets of variables (Bach and Jordan, 2005). 

Given the list of methods, a researcher planning to carry out a GWAS with 
a high-dimensional phenotype immediately faces the challenge of choosing the 
method most appropriate for her data. The challenge is particularly hard as the 
methods have been developed and presented independently, and partly even with- 
out acknowledging each other. Thus, definite guidelines, let alone an established 
strategy on how to perform a statistical analysis in a GWAS with a high-dimensional 
phenotype are clearly lacking. Our goal in this paper is to investigate the suitabil- 
ity of various alternatives for GWAS with high-dimensional phenotypes. As the 
case study we use a recently published metabolomics data set of 137 quantita- 
tive traits with a genome-wide genotype data comprising approximately half a 
milloin single nucleotide polymorphisms sampled from 509 unrelated individuals 
(Inouye et al., 2010). These data are particularly suitable for our purpose because 
the ground truth is available as the data have earlier been analysed for associations 
as a part of a larger data set (Tukiainen et al., 2012; Inouye et al., 2012). 

The genome is inherited as continuous chunks of DNA, resulting in high cor- 
relations (linkage disequilibrium) between neighboring SNPs, see, e.g., Frazer et 
al. (2007). To make the methods computationally feasible in practice, instead of 
analyzing all SNPs together, we reduce the dimensionality by exploiting this char- 
acteristic of genotype data by dividing the genome into blocks of correlated SNPs 
(referred to as LD-blocks in the remainder of this paper) which we analyse sepa- 
rately (and in parallel) from each other. While joint analysis of the whole-genome 
genotype vector could be more accurate in principle, this blockwise approach is 
favourable in two respects: first, the number of tests is reduced, and second, the 
methods will be able to borrow information not only over correlated phenotypes, 
but also over several neighboring SNPs (for example if the causal SNP happens 
not to be present in the data set). The LD-block structure has previously been uti- 
lized in association studies of univariate traits (see discussion by Balding, 2006) 
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and also to impute missing genotypes (Marchini and Howie, 2010). 

To summarize, we compare a set of recently developed sophisticated methods 
for realistic- sized GWAS with high-dimensional phenotypes. Based on experi- 
ences from the analysis of the real data along with comprehensive simulations, 
we provide practical suggestions on how data sets with similar characteristics can 
most appropriately be analyzed. The structure of this article is the following. In 
the next section, Methods, we outline the statistical methods included in our study, 
describe the metabolomics data, and summarize the procedure for generating sim- 
ulated test data sets. In the Results section, we present results from analyses of 
real data and two different simulation scenarios. We also provide a graphical ex- 
planation to some characteristics of CCA earlier reported as empirical findings by 
others. In the Discussion section we summarize our main findings, discuss open 
questions and suggest directions for future research. 

2 Methods 

The main focus of this paper is to investigate how well different methods are able 
to take advantage of a joint analysis of all phenotypes in an association study. 
Here, we briefly outline the methods that we include in our study. We utilize the 
LD-block structure to enable the methods to account for information over several 
correlated SNPs. Thus, the association test scores provided by different methods 
are for LD-blocks, not individual SNPs, the idea being first to detect associated 
regions, followed by investigation of SNP-wise weights provided by the methods^ 
Consequently, if only SNP-wise scores are available from some method, these are 
combined into a single score in a straightforward manner, for example by taking 
the maximum or average over the block, as explained in more detail below. 

2.1 Canonical correlation analysis 

Canonical correlation analysis (CCA) is a multivariate technique designed for de- 
tecting associations between two groups of variables (Hotelling, 1936; see also 
Mardia, 1979; Hardoon et al., 2004). Letting X and Y denote the n x q and n x p 
genotype and phenotype matrices and assuming without loss of generality that 
they are centered, the goal of CCA is to find a linear combination of the columns 

'We note the argument of Donnelly (2008) that even fine mapping of candidate regions is 
unlikely to point to just one potentially causal SNP and, instead, will typically narrow researchers' 
attention to a set of such SNPs to be studied further in functional assays. 



5 



of Y and a linear combination of the columns of X that are maximally correlated 
with each other. This corresponds to finding vectors a6R ? and b E R p such that 



becomes maximized. In CCA terminology, Xa is called the best linear predictor 
and Yb the most predictable criterion, although the underlying mathematics is 
symmetric. 

Denoting by S x and S y the sample covariance matrices and Sxy the cross- 
covariance matrix, the procedure for finding a and b starts by computing 

K = S x ^ SxySy ^ (2) 

and 

Ni = KK', N 2 = K'K. (3) 
Then, by the singular value decomposition theorem, K can be written as 

K=(cc h ...,a k )D(p h ...,p k y, (4) 

1/2 1/2 

where a, and /3,; are the standardized eigenvectors of Ni and N2, D = diag(Aj , . . . , X k ) 
is a diagonal matrix of non-zero eigenvalues of A^i (or ./V2), and k = rank(K). Now, 

m = 1/2 and bi = Sy l/2 fr (5) 

are termed the zth canonical correlation vectors for X and Y, respectively. The 
objective (QQ) is maximized by selecting a = a\ and b = b\. Furthermore, a 2 and 
&2 are the coefficient vectors that maximize the correlation under the constraint 
that the resulting linear combinations are uncorrelated with Xa and Yb, and so 
forth. 

Canonical correlation analysis has recently been investigated in the context of 
genome-wide association studies (Ferreira and Purcell, 2009; Naylor et al., 2010; 
Tang and Ferreira, 2012). The difference between these articles is in how the 
problem is formulated in order to apply the CCA. Basically, one is left with the 
freedom of choosing the groups of variables between which associations are in- 
vestigated. Naylor et al. (2010) divided their gene expression measurements into 
groups of three consecutive probes which were tested for association with SNPs 
located in the corresponding genomic region. Tang and Ferreira (2012) tested 
genes but pruned highly correlated SNPs to define the groups of SNPs to be tested 
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for association with a relatively low-dimensional phenotype (< 6). Recently, In- 
ouye et al. (2012) tested each SNP individually with groups of phenotypes, where 
the phenotype groups corresponded to clusters of highly correlated metabolites. 
In this paper, we consider two alternative strategies for defining the SNP groups: 
(1) selecting blocks of neighboring SNPs which are highly correlated due to link- 
age disequilibrium (the method being referred to as CCA-block henceforth), or (2) 
analysing each SNP in the block individually and taking the maximum of the in- 
dividual scores as the score of the block (CCA-single). As the phenotype, we use 
all the metabolites jointly; in the simulations we additionally consider analysing 
only a subset of highly correlated metabolites at a time. 

Two different approximations for determining the statistical significance of 
whether any of the k canonical correlations p,- is non-zero have been utilized in 
genetic association studies. The first (see, e.g., Bartlett, 1941) uses Bartlett's ap- 
proximation 

X 2 pq ~ -(ii- 1 - (p + q + l)/2)lnn(l -P?). (6) 

(=1 

The second one uses Wilks's lambda A = Ilf=i ( 1 — pf) an d Rao's F-approximation: 

1-X l / S \ fdf : 



where 



F VM*) = \-^) x {df l )' (7) 



V-4 



p q 



p 2 + q 2 — 5 



dfi=pq, (8) 



and 



f ln-3-p-q \ pq 

df2 ={ 2 ) S -Y + L (9) 

Unless stated otherwise, the score for CCA-block is taken to be the value of the 
maximum canonical correlation. Alternatively one could use (minus logarithm 
of) the p-value calculated for the block using either Bartlett's or Rao's approxima- 
tions. However, we present some comparisons of these three possible approaches. 
Note that with CCA-single only one canonical correlation may be calculated, giv- 
ing a one-to-one mapping between the (maximum) canonical correlation and the 
corresponding p- values. 
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2.2 Sparse regression for multiple correlated traits 



In the machine learning community, several regularized regression methods have 
recently been introduced for modeling correlated phenotypes, see, for instance, 
Kim and Xing (2009a,b) and Sohn and Kim (2012). In these methods, the columns 
y k of the phenotype matrix Y are often modeled using separate regression models 

Vk = X/3 k + £k, k=l,...,p. (10) 

One popular method, GFlasso (Kim and Xing, 2009a), facilitates borrowing of 
information between correlated phenotypes through learning the parameters with 
penalized least square^: 

B = argmin £(y k - X/3 k ) T (y k - X/3 k ) + 

k (11) 
A-EElifyjfcl+r £ d/£l£>-sign(r ra/ )/3 ;7 |, 

j k (m,l)€E j 

where B contains jointly the estimated parameter vectors /3 k . In (fTTI) . there are 
two regularization parameters, A and 7, to be learned through cross-validation. 
The purpose of the parameter A is to shrink the coefficients towards zero, favoring 
models with few non-zero coefficients. The term including the 7 parameter has 
been included to favor sharing information; it encourages the sizes of the effects 
jB/ m and fiji of SNP j on correlated phenotypes m and I to be similar. In Equation 
(fTTI) . r m i is the correlation between the /nth and Zth phenotypes and E is an a priori 
specified phenotype correlation graph with edges representing correlations to be 
accounted for in the model. 



2.3 Regression with hidden confounding factors 

High-dimensional phenotypes are often correlated due to hidden confounders not 
related to genetic factors. For example, gene expression measurements may be af- 
fected by environmental conditions and experimental procedures (Leek and Storey, 
2007; Gibson, 2008), which, unaccounted, would cause reduced power and in- 
creased false positive rate in association studies. To handle such confounders, 
several methods where the hidden determinants are explicitly modeled (Stegle et 
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This particular formulation is termed G^.Flasso by Kim and Xingl d2009ah . We used cutoff 0.7 



in our analyses to define the edges E in the correlation graph. 
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al., 2010; Parts et al., 2011; Fusi et al., 2012) have been presented. The basic 
model has the form: 

Y = [i+SV + XW + e, (12) 

where /i is a vector of phenotype-specific mean terms, X and S denote the ob- 
served genotypes and hidden confounders, respectively, with the corresponding 
regression coefficients collected into the matrices W and V. The £ is a matrix 
comprising Gaussian i.i.d. noise terms. 

The methods differ in the way the model specified in Equation (fT2)) is learned. 
The most thorough way (Fusi et al., 2012) is to jointly learn the hidden factors 
and SNPs that influence the phenotype. Here, to facilitate straightforward parallel 
processing of the genotype blocks, we utilize an approximation implemented in 
PEER software (Stegle et al., 2010), where the hidden factors are first learned 
independently of the genotypes and their effects on the phenotypes are removed. 
The resulting residuals are then used in the place of the phenotypes to test for 
associations with the genotypes using univariate methods described below. 

2.4 Univariate association testing and principal component anal- 
ysis 

As the baseline we use two methods; the first is generally used in univariate asso- 
ciation analyses, and the second is a straightforward multivariate method. We use 
linear regression models to test for assocation between genotype Xj and phenotype 
y u i= l,...,/?and j= l,...,q: 

yi = Po+PiXj + £i. (13) 

The score for an LD-block is taken to be either the smallest p-value for the /3i 
coefficient in any of the genotype-phenotype pairs tested (referred to as best-pair), 
or the average of the corresponding t-test scores over all genotype-phenotype pairs 
(avg-pair). As the second simple baseline method we calculated as many principal 
components for both genotypes and phenotypes as needed to explain at least 99.5 
percent of the variation in each data set. Then we tested for associations between 
the principal components using the univariate test described above. 

2.5 Metabolomics data set 

As the real test case, we used a data set published by Inouye et al. (2010) which 
consists of genome-wide SNP data along with metabolomics measurements (for 
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details concerning metabolomics data collection, see Soininen et al., 2009, and 
Kettunen et al., 2012). SNPs with low minor allele frequency (<0.02) and de- 
viation from Hardy-Weinberg equilibrium (p < 0.00001) were removed as a pre- 
processing step, leaving approximately 550,000 SNPs from 568 unrelated individ- 
uals. The metabolomics data set comprised 137 metabolites, most of which repre- 
sented NMR quantified levels of lipoproteins classified into 4 subclasses (VLDL, 
IDL, LDL, HDL), together with quantified levels of amino acids, some serum ex- 
tracts and a set of quantities derived as ratios of the aformentioned metabolites. 
The final sample size was 509 individuals having both data types. Effects of age, 
sex, and lipid lowering medication were regressed from the metabolomics data as 
a pre-processing step (with the latent variable regression approach this was done 
jointly with removing the effects of the hidden confounders). The correlation 
matrix of the metabolomics data is shown in FigureQ] A distinguishing character- 
istic of the data is the blocklike structure composed of groups of highly correlated 
metabolites. 

We defined the genotype block structure by setting block boundaries at loci 
where adjacent SNPs were more than 0.01 cM apart using genetic map from 
HapMap Release 22 (NCBI 36) (Frazer et al., 2007). This resulted in 68,124 LD- 
blocks with sizes ranging from 1 (32 percent of the blocks comprising 4 percent of 
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the SNPs) to 426 SNPs. Based on graphical inspection, the blocks defined in this 
way seemed to capture the block-diagonal correlation structure of the genotype 
data reasonably accurately, although we acknowledge the inherent arbitrariness 
related to any single fixed cutoff value. By visual comparison, similar divisions 
could be obtained with the Haploview software (Barrett, 2005). 

2.6 Simulated data 

As the basis of our simulations we used two randomly selected LD-blocks from 
the real genotype data. The genotypes for the simulated data sets were created by 
sampling with replacement from the set of all available genotypes in these regions. 
A single SNP was used to generate the effects using a linear model. Afterwards, 
this causal variant was removed from the genotype data leaving a network of pos- 
sibly non-linear relationships between the remaining SNPs and phenotypes. This 
corresponds to the scenario that the true causal variant has not been included in 
the data set. After generating the effects, the empirical correlation matrix of the 
metabolomics data was used to simulate correlated additive multivariate Gaussian- 
distributed noise on top of the simulated phenotypes. 

The following factors were varied in the experiments: (1) The size of the LD- 
block was either 6 or 22 before removing the causal variant. (2) The correlation 
between the causal variant and the closest SNP in the data set was fixed by man- 
ually selecting a SNP from data that had the desired correlation with the causal 
variant. (3) As the noise correlation matrix we used either the whole metabolomics 
matrix (137 features) or submatrices corresponding to IDL (6 features) or VLDL 
(31 features) metabolite subgroups. (4) When the full 137 phenotype features 
were simulated, the affected traits were selected by mimicking the effects ob- 
served in real data (see Figure 1 in Tukiainen et al., 2012) such that the total 
number of affected traits was 23. These traits were selected such that they corre- 
sponded to three different groups of correlated traits, effects on one of the groups 
having a different sign from the other two. With smaller numbers of simulated 
phenotypes, the affected traits were selected analogously such that they corre- 
sponded to some correlated subclass of the real metabolites. (5) The effect sizes 
were drawn randomly from the interval [0.75/3 max ,/3 mflx ], where /3 max is the value 
reported in Results. Note that actual effects are smaller as the causal variant is not 
present in the genotype data. Consequently, an upper bound for the proportion of 
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Rankings of LD-blocks with known causal variants in the metabolomics data 
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LD-block 



Figure 2: Rankings of LD-blocks with known causal variants. The rankings were 
scaled linearly to the interval [0, 1] such that the block with the highest score got 
rank 1 and the block with the lowest score got rank 0. The total number of blocks 
was 68,124. The blocks are ordered according to the best-pair score 



variance explained (PVE) by an effect can be obtained from 

pvf - ALx Var (*) n4 ^ 

AL.VarW + l' 1 } 

where Var(x) is the variance of the causal variant, which in our simulations was 
always < 0.5. Thus, if j8 max << 1, the PVE can be roughly bound from above by 
5/3 2 



3 Results 
3.1 Real data 

We compared the methods using as the ground truth a set of 31 SNPs reported by 
Tukiainen et al. (2012), compactly summarized in Table 13 of Tukiainen (2012jj. 
These findings have been obtained using standard linear regression with a data set 
of 8,330 individuals (data in our study is a subset of these data). 

We first checked whether the methods were able to find the ground truth LD- 
blocks with significant scores. The multiple-testing corrected thresholds were 

3 During the writing of this article a study using CCA for one SNP at a time was published by 
Inouye et al. (2012). We did not include findings of that article in our baseline, in order not to bias 
the results in favor of CCA. 
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Table 1: Comparison of the methods with the real data. First five columns show 
summary of the rankings of 31 LD-blocks containing known causal variants. The 
columns are interpreted as follows: best: the number of times the method gave 
the highest ranking to a block with a known causal variant, > 0.95: the num- 
ber of blocks with known causal variants ranked among the top 5 percent of all 
blocks, max/mean: the maximum/mean rankings of the blocks with known causal 
variants, sign: the number of significant findings. The last column, time, shows 
the computation times for a single run (i.e., not including permutation testing). 
The number of significant findings for GFlasso is not available due to extensive 
computation time required by the permutation sampling. 





best 


>0.95 


max 


mean 


sign 


time 


CCA-single 


9 


9 


1.000 


0.814 


1 


7h 


CCA-block 


6 


6 


0.998 


0.806 





lh 


CCA-block (p-val) 


1 


2 


0.994 


0.358 





lh 


PEER 


2 


5 


1.000 


0.742 


1 


20h 


best-pair 


3 


8 


1.000 


0.743 





20h 


PCA 


7 


8 


0.994 


0.798 





llh 


GFlasso 


3 


3 


0.969 


0.487 


NA 


2,200h 



obtained by considering maximum scores in 100 repeated analyses with permuted 
data sets (except for GFlasso due to extensive computation time). With the lim- 
ited amount of data, only one of the LD-blocks with causal variants scored signif- 
icantly after the multiple-testing correction. The significant scores were given by 
methods CCA-single and PEER. To further examine how well the different meth- 
ods are able to highlight the LD-blocks with known causal SNPs, we ordered all 
LD-blocks using the scores from the methods. The rankings of the blocks with 
known causal SNPs are shown in Figure [2] with a summary given in Table [Q De- 
tailed listing of the rankings and the actual scores are given in Supplementary 
Tables 1 and 2@ The method that most often ranks the causal LD-blocks highest 
is CCA-single, with PCA and CCA-block the closest runner-ups. It is notable that 
these are precisely the methods that test for an association using all phenotypes 
jointly. Investigating Figure [2] more closely shows that all of these methods are 
capable of giving high scores to even those variants which are completely missed 

4 To keep the results uncluttered, the rankings for avg-pair are shown only in the Supplementary 
Tables 1 and 2. 
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by the univariate analysis. 

Using the p-values from Rao's approximation to rank the LD-blocks (CCA- 
block (p-val) in Table [B seems to work badly compared to using directly the 
maximum canonical correlation (CCA-block). Indeed, when we investigated the 
p-values more closely, many of them were very close to unity, indicating that the 
distribution of the test scores does not match with the assumptions underlying the 
significance test. We believe the main reason for this is the non-Gaussianity of 
the genotype data. The downside of using the maximum canonical correlation is 
that larger blocks are a priori more likely to obtain high canonical correlations, 
leading to reduced power with smaller blocks. Indeed, the causal blocks with only 
a few SNPs were ranked higher by CCA-single than with CCA-block (exact results 
not shown). 

Figure |3] shows detailed results from CCA-single and PEER for the LD-block 
that obtained significant scores by these methods. The SNP-wise weights from 
CCA-block are added for comparison. The SNP scores from PEER are clearly 
larger than those of CCA-single. This is due to the fact that CCA-single gives a 
single p-value jointly for all metabolites, while PEER uses the smallest p-value 
over the metabolites. For the same reason, PEER gives in general high scores to 
a larger number of SNPs than CCA-single. Although with the LD-block shown 
in the Figure |3] the scores from PEER and CCA-single are highly correlated (pre- 
sumably due to the strong associations between the SNPs and the metabolites), 
this did not happen in general with LD-blocks with known causal SNPs. We com- 
pared the number of SNPs with scores greater than the average of maximum and 
minimum scores in a block, and found out that in 19/31 LD-blocks this num- 
ber of high- scoring SNPs was higher with PEER than with CCA-single, and with 
9/31 blocks the number was higher with CCA-single (three draws). With CCA- 
block even fewer SNPs were pointed out by the scores, such that in 17/31 causal 
blocks the number of high-scoring SNPs was higher with CCA-single than with 
CCA-block, and only 5/31 in the other way. This indicates that when not only the 
metabolites, but also the SNPs are tested jointly in CCA-block, only those SNPs 
from a group of correlated SNPs get high scores that are most associated with the 
phenotype traits. 

Finally, Supplementary Table 1 shows SNP-wise rankings relative to all 550,000 
SNPs for 8 causal SNPs (as opposed to blocks) that were included in our data set. 
Although these rankings are not directly comparable to the block rankings, it is 
notable that the SNP-wise CCA utilized by Inouye et al. (2012) gives for 7/8 
SNPs lower rankings than the blockwise CCA formulations for the corresponding 
blocks. 
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Figure 3: Detailed results for the LD-block significantly associated with the 
metabolomics profiles. The location of the known causal lead variant (SNP 
rsl74547) is denoted by the red vertical line; however, this SNP was not present 
in our data set. Other SNPs with reported associations with metabolite traits (from 
dbSNP, Sherry et al., 2001) are shown with blue vertical lines (solid if SNP was 
included in our data, dotted otherwise). The scores for SNPs from PEER and 
CCA-single are obtained by considering the minus logarithm of the correspond- 
ing p-values (the score of the whole block given by the maximum of these). The 
absolute values of the canonical weights (scaled to have the maximum equal to 
that of CCA-single) are shown as the scores with CCA-block. 

Table \T\ shows the computation times of a single run with the different meth- 
ods. We note that the running times are highly dependent on the underlying im- 
plementation. For CCA, PCA and pairwise methods we used standard functions 
available in the R software package. The running time of PEER is in practice 
equal to best-pair with only a few extra minutes added by learning the hidden 
factors and removing their effects in the beginning. 

3.2 Simulations 

We investigated the power of the methods to detect associations in two different 
simulation setupsJl Throughout, we used data sets simulated with effect size set 
to zero to determine significance thresholds yielding the empirical false positive 
rate equal to 0.05. 

5 The results for avg-pair are not shown. However, best-pair was found to perform better than 
avg-pair in our experiments. 
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n=500, p=0.84 n=150, p=0.84 n=500, p=0.24 n=150, p=0.24 




1.S 0.4 0.1 0.025 O.OOS 1.6 0.4 0.1 0.025 0.006 1.6 0.4 0.1 0.025 0.006 1.6 0.4 0.1 0.025 0.006 



Effect size Effect size Effect size Effect size 

Figure 4: Power of different methods in the whole metabolomics profile simulation 
scenario. Factors that were varied include n, the number of individuals, p, the 
largest correlation between the (not included) causal variant and any of the SNPs 
included in the data set, and the effect size. The value of the effect size shown 
in the figure is relative to one standard deviation of noise. Corresponding to each 
parameter configuration, 200 data sets with the specified effect and 200 data sets 
with zero effect were analysed. The empirical false positive rate was fixed to 0.05 
by selecting an appropriate significance threshold using the data sets simulated 
under no effect. In the panel on the left the curve for CCA-single is behind CCA- 
block. 



3.2.1 Whole metabolomics profile simulation 

In the first setup, we investigated how the size of the effect, the number samples, 
and the correlation between the causal variant and the closest observed SNP affect 
the power of the different methods to detect associations. We fixed the dimension 
of the phenotype vector to 137 (using the whole metabolomics correlation matrix 
to generate the noise). The results are summarized in Figure HI The following 
conclusions can be drawn: 

1 . CCA is the best method when the number of number of samples is large 
enough relative to the dimension of the genotype and phenotype blocks 
tested, but breaks down otherwise. The difference between the CCA-single 
and CCA-block is intuitive: if none of the SNPs present is highly correlated 
with the causal variant, the CCA-block is capable of better utilizing the in- 
formation in the whole genotype block, outperforming CCA-single. On the 
other hand, when some observed SNP is highly correlated with a (single) 
causal variant, CCA-block has no advantage over CCA-single. 

2. PEER is among top-three methods in all setups and seems to be the method 
of choice when the number of observations is small compared to the dimen- 
sion, that is, the realm for which it was originally developed. The power of 
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PEER decreases as the effect size gets large enough. This behavior is ex- 
pected and can be explained by the fact that PEER starts explaining away the 
true effect with latent confounders. A possible solution suggested by Stegle 
et al. (2010) would be to iterate between learning the effects and latent con- 
founders. Alternatively, joint learning of the effects and confounders (Fusi 
et al., 2012) would likely improve the results in this respect. 

3. GFlasso did not work well in our simulation setup. We hypothesize that the 
reason may be that the Lasso type regularization ceases to work reasonably 
when the number of SNPs is too small and the SNPs represent a single block 
with relatively high inter-correlations. Furthermore, the hyperparameters 
learned by cross-validation were different for different data sets, which may 
affect the ranking of the data sets by the maximum regression coefficient. 
Some kind of pooling to learn a single common hyperparameter would seem 
reasonable and will be worth studying later. Discouraged by these results, 
we did not include GFlasso in the second simulation setup (see below) to 
save some computation time. 

4. PCA performs better than the other methods when no single SNP with high 
correlation with the causal variant is present in the data set, and sample 
size is small relative to the dimension of data (the fourth panel in Figure 
IU). However, in this setup, even PCA seems to require unrealistically large 
effects sizes before satisfactory behaviour can be expected. 

3.2.2 Metabolite subgroup simulation 

In the second setup, we investigated the effect of residual (i.e. noise) correlation, 
number of features in the genotype and phenotype data sets, and the number of 
affected traits on the power to detect associations. As the basis of simulating the 
phenotypes we used the empirical correlation matrix of IDL (6 features) or VLDL 
(31 features) metabolite subgroups. The results from this setup are summarized 
in Figure [5] 

The most obvious trend is the improved performance of CCA when the resid- 
ual correlation increases. Some intuition to this behavior can be obtained by in- 
terpreting the n observations for variables v, and xj as vectors in an rc-dimensional 
space, and noticing that the correlation between any two variables equals the co- 
sine of the angle between the correponding vectors (see, e.g., Mardia, 1979). In 
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SNPs=21, phe:10/31, p=0.84 SNPs=21, phe:10/31, p=0.24 



SNPs=5, phe:1/6, p=0.93 




SNPs=21, phe:31/31, p=0.84 SNPs=21, phe:31/31, p=0.24 SNPs=5, phe:6/6, p=0.93 




Residual correlation Residual correlation Residual correlation 



Figure 5 : Power of different methods in the metabolite subgroup simulation sce- 
nario. The titles of the panels show the number of SNPs, and the maximum corre- 
lation between the causal variant and the most correlated SNP present in data, p. 
Text phe:x/y in the title tells the number of phenotypic traits (y) and the number 
of traits affected by the causal variant (x). High residual correlation corresponds 
to using the empirical correlation matrix of a specific group of highly correlated 
metabolites to simulate the noise. The less correlated data sets are created by 
raising the correlation matrix to powers 10, 20, 40 and 80, effectively pulling the 
off-diagonal elements of the correlation matrix towards zero. Corresponding to 
each parameter configuration, 400 data sets with the specified effect and 400 data 
sets with zero effect were analyzed. The empirical false positive rate was fixed to 
0.05 using the data sets simulated under no effect. PEER converged in less than 
5 percent of the data sets with 6 phenotypes (the third column), and is not shown 
for these data sets. 

particular, the canonical correlation defined in Equation (OQ) is the cosine of the an- 
gle between linear combinations Xa and Yb. Thus, canonical correlation analysis 
attempts to minimize the angle between the linear combinations. Figure [6] illus- 
trates this in a simplified situation with two phenotypes and a single genotype. 
One of the phenotypes, PI, is correlated with the genotype, that is, it has a com- 
ponent that points to the same direction as the genotype. The other phenotype, P2, 
falls on the plane perpendicular to the genotype, and is therefore uncorrelated with 
the genotype. When PI and P2 are correlated (left panel), P2 can be used to can- 
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Figure 6: Graphical illustration for why CCA works better with correlated data. 
The green vectors PI and P2 represent two phenotypes in n-dimensional space. 
The blue vector represents the genotype. Note that PI has a component that points 
to the same direction as the genotype. The red arrow shows the linear combination 
of the phenotypes that has the minimum angle with the genotype. In the left panel 
PI and P2 are highly correlated, i.e. pointing roughly to the same direction, in the 
right panel, the correlation between PI and P2 is small. 

eel the component of PI that is perpendicular to the genotype, leading to higher 
canonical correlation. In the other extreme PI and P2 would be completely un- 
corrected (perpendicular to each other), in which case P2 would be of no use in 
making the angle between PI and the genotype smaller. Thus, better use can be 
made of the component pointing to the same direction with the genotype if the 
phenotypes are correlated, which explains the higher canonical correlations with 
correlated phenotypes. 

Similar reasoning can be used to explain the behavior noticed by others (Fer- 
reira and Purcell, 2009; Tang and Ferreira, 2012; Waaijenborg et al., 2008) that the 
power of CCA decreases if the genotype affects simultaneously all phenotypes. If 
all phenotypes are positively correlated and have an equal component to the di- 
rection of the genotype, it is easy to visualize that that component is removed 
simultaneously when subtracting the component perpendicular to the genotype. 
In our simulations this behavior was not prominent. The explanation is that, al- 
though in our simulations the causal SNP affected a group of correlated SNPs, 
the exact effect sizes were selected randomly from a specific interval, thus not 
canceling each other completely when taking linear combinations. 
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4 Discussion 



In this work, we have investigated methods that can be used to detect small effects 
in GWASs with high-dimensional phenotypes, by taking the whole phenotype 
vector jointly into account. Our main conclusion, supported by both simulations 
and analysis of a real data set, is that canonical correlation analysis appears to be 
the most powerful approach for this purpose. On the other hand, if the number 
of samples is reduced to the level of the dimension of the genotype or phenotype 
group to be tested, regression models with latent confounders (such as imple- 
mented in PEER) seem promising. Furthermore, with the real data, PEER and 
CCA were the only methods in our study with which any of the known causal 
variants got significant scores after multiple-testing correction. 

We allowed the methods to combine information in the genotype data in a 
computationally feasible way, by dividing the genome into blocks of correlated 
SNPs. Compared to analysing the whole genome jointly, the possibility to process 
the blocks in parallel makes the methods considered in this study computationally 
feasible even with permutation sampling to obtain multiple-testing corrected sig- 
nificance thresholds. Even if CCA analysis using jointly the full set of genotypes 
was possible, the interpretation of the canonical components might be tedious, 
as discussed e.g. by Waaijenborg (2008). Considering a block of neighboring 
SNPs at a time focuses the putative effect on certain part of the genome making 
the interpretation easier. Further, picking the SNP with the largest coefficient in 
the canonical correlation vector seems a promising way of recovering the SNPs 
most correlated with the phenotypes, the strategy also suggested by Naylor et al. 
(2010). On the other hand, compared to analysing each SNP separately, the block- 
wise approach reduces the dimensionality of the problem, lessening the multiple- 
testing problems. In our simulations, we saw increased power to detect causal 
variants that were not included in the data set when the whole block of genotypes 
was tested jointly. Further, with the real data the blocks with causal variants were 
ranked higher by CCA (relative to all blocks) than the actual causal SNPs (relative 
to all SNPs). 

When testing for association using CCA between a block of genotypes and 
phenotypes (CCA-block), we found out that the p-values obtained with Rao's ap- 
proximation did not behave very well, as the values were often very close to unity. 
We obtained similar results also with Bartlett's approximation (exact results not 
shown). Therefore, we used the maximum canonical correlation as the test score. 
A comparison of the three alternatives with the whole metabolomics profile sim- 
ulation setup is shown in Supplementary Figure 1 . However, using the maximum 



20 



canonical correlation itself as the test score has the downside that larger blocks 
are a priori more likely to obtain high canonical correlations, leading to reduced 
power when testing with smaller blocks. One suggestion to improve CCA in this 
respect would be to regress the effect of the block size from the CCA scores to 
make results from blocks with differing sizes better comparable; however, this 
requires further investigation. 

Several versions have recently been introduced in order to extend the usability 
of CCA to high-dimensional data sets under small n large p conditions, including 
regularized (Chen et al., 2012), Bayesian (Wang, 2007; Klami and Kaski, 2007; 
Virtanen et al, 2011) and kernel CCA (Hardoon et al., 2004). These methods 
often rely on computationally extensive techniques, such as cross-validation, to 
learn the hyperparameters of the model. Investigation of possible gains at the price 
of increased computational burden with realistic-sized GWAS data sets remains 
an open question. 

Supplementary Data 

The Supplementary Data referred to in the text is available as a single zip-file from 
http : //users . ics . aalto . f i/pemartti/high_dimensional_supplementary/ 
The file contains Supplementary Tables 1-2, Supplementary Figure 1, and the cap- 
tions to the supplementary Tables and Figures. 
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